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Optimized bytecode interpreter of virtual machine instructions 



FIELD OF THE INVENTION 

The invention relates to run-time optimization of interpreted programs. It 
relates, more particularly, to a method for optimizing interpreted programs by means of a 
virtual machine which dynamically reconfigures itself with new macro operation codes. The 
5 invention applies to any bytecode-based programming language. 

BACKGROUND OF THE INVENTION 

Bytecode-based languages with programmer-visible stacks are popular as 
intermediate languages for compilers, and also as machine-independent executable program 
10 representations. They offer significant advantages for network computing. The arhcle 

"Optimizing direct threaded code by selective in-lining", by I. Piumanta and F. Riccardr, m 
Proceedings of the ACM SIGPLAN '98 Conference on Programming Language Design and 
implementation (PLDI), Montreal, Canada, June 17, 1998, pp.291-300, describes a technique 
• as mentioned in the opening paragraph, for optimizing interpreted programs. A vntual 
15 machine (VM) is used to interpret the programs thanks to a VM interpreter. The VM is a 
software implementation representing an architecture of a virtual processor on wluch 
applications especially compiled for this architecture are executed. The instructions of the 
virtual processor / machine are called bytecodes. The VM interpreter is the part of the VM 
which represents the bytecodes' execution mechanism. The bytecodes are said to be 
20 interpreted by the VM interpreter. The bytecodes' execution mechanism is currently 

implemented as an infinite loop with a switch case bloc. The technique described in the dted 
article applies to direct threaded interpreters. Threaded code interpreters execute the 
bytecodes in line. Each bytecode translation contains the reference to the next bytecode. 
Therefore, the bytecode translation as executed by a threaded interpreter does not involve the 
25 infinite loop. Even though threaded interpreters offer a performance advantage, they are too 
slow and require too much memory to be convenient for most embedded systems. In a direct 
threaded code interpreter, as described in the cited article, the VM bytecodes are represented 
with the address of their implementation, so that each bytecode can directly jump to the 
implementation of the next bytecode. A table is initialized before the translation operation, 
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with the addresses of each bytecode of the application in order that, when the bytecode 
translation takes place, the physical addresses of the bytecode implementations can be 
quickly accessible. The table allows to switch from a bytecode to another one. Direct 
threaded interpreters are rather fast but they involve code expansion. By changing bytecodes 
into direct threaded codes, the code size is increased by approximately 150%, because the 
operation codes are replaces with the addresses of their implementation code. In general, 
addresses need 4 bytes whereas the operation codes need only 1 byte. Therefore, direct 
threaded interpreters increase memory consumption and are thus not very suitable for 
embedded systems. 



SUMMARY OF THE INVENTION 

It is an object of the invention to provide a method for optimizing run-time of 
interpreted programs which is very convenient for embedded systems. Such systems may be, 
for example, satellite or cable transmission systems embedded into a digital video receiver, 
often called a set top box. But the invention also applies to any product whose operating 
system is based on a bytecode-based programming language. The invention also allows to 
save memory and CPU resources and can improve the performance of the system. 

In accordance with the invention, it is described a method of optimizing 
interpreted programs in a virtual machine interpreter of a bytecode-based language, wherein 
the virtual machine dynamically reconfigures itself with new macro bytecodes (or opcodes) 
replacing sequences of simple bytecodes, and wherein the virtual machine interpreter is 
coded as a threaded code interpreter for translating the bytecodes into their implementation 
codes. The threaded code interpreter according to the invention is coded as an indirect 
threaded code interpreter thanks to a reference table which contains the implementation 
addresses of the bytecodes in order that during translation of a bytecode, the address of the 
next bytecode is retrieved to be able to jump to the next bytecode. 

BRIEF DESCRIPTION OF THE DRAWINGS 

The invention and additional features, which may be optionally used to 
implement the invention, are apparent from and will be elucidated with reference to the 
drawings described hereinafter. 

Fig. 1 is a bloc diagram illustrating the features of a method according to the 

invention. 
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Fig. 2 is a bloc diagram iterating the features of a method according to the 

oreferred embodiment of the invention. 

Fig. 3 is a schematic diagram illusbuting an example of arecerver acconhngto 

the invention. 

DETAILED DESCRIPTION OF THE INVENTION 

Tne invention will be » explained in greater detail, raking ft. lava language 
as an example, to illusfra* a novel run-time optimization srrafcgy apptieable «o any bytecode- 
based language^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 

aBoge.be, tire Java virtual machine (VM) interpret and «o .ransla* the application's 
bytecode into native machine code prior «o its execution (hence the Just-In-Tune 

^pressing U into a more convenient native fom, While .his may be an efficen, way of 

one bid, because a by.ecode-based language is more compact than anaove code, and of 
U^e CPU (Cental Processing Uni.) resources in *e ofter hand, because re-mappmg Java 
bytecodes on the target machine is not an easy task. 

The invention is also based on some sort of dynannc code generabon, but ,<s 
goal is no. ft* of translating the application's Java by*code into native machine code, bu, 
Lher ,o dynamically adapt the Java VM to the execution of the appticanon s specie 
decode sequences. The original application's Java byteeode is .bus preserved, whrle fte VM 
is dynamically enriched wift novel byfccodes or operation codes (opcodes) .mprovmg «s 

execution efficiency. 

There are several advantages to this approach : 
It does not increase the sizeofthe executable code.the application is left into memory- 
efficient Java's bytecoded representation, 

The VIM'S execution mechanisms is economic : there is only one execution 
.echanism, therefore the VM executing the application will not have to deal with multiple 
code representations which contributes to reduce its size and improve its rehabmty, 

The code generation technique is rather simple: the VM optimizer has a very 
simp le structure, the application's bytecode analysis is a one-pass table-driv^ pr^edure 
taking very little CPU resources, and which directly drives the synthesis of new bytecode, 
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These properties make the invention suitable for embedded applications. The 
foundation of the optimization technique according to the invention lies in the study of the 
costs of the very basic mechanisms of an interpreter with respect to a category of "typical" 
applications. The relevance of the application's profile lies in the potential benefit attainable 
from the various optimization techniques that might be envisaged. Since the target is 
embedded applications, what might be define as "typical" applications are, for example, 

control applications, graphical user interfaces, and so forth. 

It is assumed that the target applications are well mapped on the primitives 
offered by the underlying VM (object manipulations). Therefore, they will not benefit much 
from radical code transformations, but rather from a general improvement of the VM's 
execution mechanisms. To understand how to improve the efficiency of the VM, it was made 
use of Amdhal's law. In the version stated by Hennessy and Patterson, Amdhal's law is 
expressed as follows : "the performance improvement to be gained from using some faster 
mode of execution is limited by the fraction of time the faster mode can be used", or more 
synthetically : "make the common case fast". 

Interpreter's performance depend on the representation chosen for executable 
code and on the mechanism used to dispatch the bytecodes. The first approach to reduce the 
implementation cost was to reduce the cost of instruction dispatching because the heart of an 
interpreter is its instruction dispatching mechanism. The typical interpreter, called pure 
bytecode interpreter, is implemented like a processor simulation : a large switch statement 
sitting in an endless loop, dispatching instructions to their implementations. Therefore, the 
inner loop of a pure bytecode interpreter is very simple : fetch the next bytecode and dispatch 
to the implementation using a switch statement. The interpreter is an infinite loop containing 
a switch statement to dispatch successive bytecodes, and passes control to the next bytecode 
by breaking out of the switch to pass control back to the start of the infinite loop. The 
following set of instructions illustrates an implementation of a typical bytecode interpreter. 



Loop ( 

Op = *pc++; 
Switch (op) { 
Case op_l : 

// op_l's implementation 

break; 
case op_2 : 
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// op_2's implementation 
break; 

case op_3 : 

// op_3's implementation 

5 break; 

} 

Assuming the compiler optimizes the jump chains from the breaks through the 
implicit jump at the end of the loop back to its beginning, the overheads associated with this 

10 approach are as follows : 

increment the instruction pointer pc, 

fetch the next bytecode from memory, 

a redundant range check on the argument to switch, 

fetch the address of the destination case label from a table, 
15 jump to that address, 

and at the end of each bytecode : 

jump back to the start of the loop to fetch the next bytecode. 
In this case the cost of instruction dispatching, ignoring all other sources of 
inefficiency such as the actual implementation of the switch statement, consists of : 
20 2 memory accesses : one to retrieve the value of the next instruction, one to 

retrieve the address of the instruction's implementation, 

plus 2 branches : one to jump to the bytecode's implementation and another 
one to go back to the beginning of the loop. Jumps are among the most expensive instructions 

on modem architectures. 

25 Pure bytecode interpreters are easy to write and to understand. They are also 

highly portable but rather slow. They are thus not convenient for embedded systems. In the 
case where most bytecodes perform simple operations, as in the example illustrated herein 
before, most of the execution time is wasted in performing the dispatch. Actually, in order to 
be aware of the real cost of the mechanism, it should be compared with the cost of the 

30 execution of a single bytecode. Java bytecodes have a very low-level semantics, and their 
implementation is often trivial. Therefore, the most commonly executed bytecodes are 
actually less expensive than the dispatching mechanism itself. 

A first improvement in efficiency according to the invention is the adoption of 
indirect threaded code as illustrated with the set of instructions below : 
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Op_l_lbl : 

// op_l's implementation 
goto opcode_table (*pc++) ; 

Op_2_lbl : 

/ / op_2 ' s implementation 
goto opcode_table (*pc++) ; 

Op_3_lbl : 

// op_3's implementation 
goto opcode_table (*pc++) ; 
where Op_l_ Ibl, Op_2_ Ibl and Op_3_ lbl represent 3 different operation codes interpreted 

by the VM interpreter. 

According to this implementation, called indirect threaded code, the VM is 
coded as an indirect threaded code interpreter. During bytecode translation, the address of the 
next bytecode is resolved. A reference table, denoted opcode_table, contains the bytecodes 
implementation addresses. The reference table is accessed by an index of a pointer (*pc++). 
For each bytecode translation, the address of the next bytecode is retrieved to jump to the 
next bytecode. In this way each bytecode implementation directly jumps to the next bytecode 
implementation, we have saved one branch, the outer loop, and the unnecessary inefficiency 
of the switch statement's implementation (range checking and default case handling). 

According to a preferred embodiment of the invention, the translation is 
carried out by exploiting unused bytecodes of the bytecode-based language VM specification. 

The bloc diagram of figure 1 summarizes the mains steps of the method 
according to the invention for translating a bytecode, e.g. the bytecode bipush, into native 
instructions with an indirect threaded code interpreter : 

step K0= BIPUSH ; beginning of the method of translating the bytecode 
bipush which consists of putting l A word on a stack, the 54 word being the bipush parameter 
(par) 

step Kl= PAR ; retrieve the bipush parameter (par) 

step K2= PUT; put the bipush parameter on the stack 

step K3 = GOTO; go to the next bytecode (goto opcodejable (*pc)) by 

looking into a reference table opcode_table containing the address of the next bytecode's 

implementation. 

The adoption of threaded code by itself can double the VM ! s performance, but 
as we will see in the following it can also offer other interesting optimization opportunities. A 
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statistic analysis of Java's bytecodes shows that, on average, about every 5-6 instructions 
there is abranch. On any modern CPU, branches are intrinsically expensive instructions, 
since they can cause pipeline stalls and/or trigger external bus activity. Besides, for loop 
unrolling or method call m-lming, there is not much that can really be done about it. Even 
when recompiling the code into a native representation, the control statements will still be 

there. . 

Recent studies on the CPU utilization for object oriented applications on high- 
end workstations show that the CPU can spend as much as 70% of its clock cycles to recover 
from pipeline stalls, as the effect of mispredicted branch statements, and to wait for data and 
instructions from a main memory (cache misses). Additionally, CPUs available in embedded 
systems normally have very small caches, no hardware assistance for dynamic branch 
prediction, and low and/or narrow memory interfaces with no L2 caches. These additional 
constraints will reduce even further the CPU utilization and performance. 
Java bytecodes can be separated into two categories : 
simple operation codes (loads, stores, arithmetic and control statements) and 
complex operation codes (memory management, synchronization, etc.). 
Simple bytecodes are typically less expensive than the dispatching 
mechanism. Complex bytecodes are instead much more expensive, the dispatching cost 
representing only a minimal fraction of the total cost of the bytecode execution cost. Simple 
bytecodes are also executed much more frequently (about an order of magnitude) than 
complex ones, implying that a classical Java interpreter spends most of its time dispatching 
bytecodes rather than really doing anything useful. It is thus assumed that it would be 
definitively more effective to reduce the dispatching cost for simple bytecodes than for 
complex ones. 

Translating bytecodes into indirect threaded code also gives the opportunity to 
make arbitrary transformations on the executable code. One such transformation is to detect 
common sequences of bytecodes and translate them into a single threaded "macro code ". 
This macro code performs the work of the entire sequence of original bytecodes. Therefore, 
according to a preferred embodiment of the invention, it is proposed to replace sequences of 
) simple bytecodes by some equivalent "macro codes". For example, as presented in the cited 
article, the bytecodes "push literal, push variable, add, store variable" can be translated into a 
single «add-literal-to-variable" macro code in the indirect threaded code. Such optimization 
are effective because they avoid the overhead of the multiple dispatches that are implied by 
the original bytecodes, but elided within the macro code. A single macro code which is 
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translated from a sequence of N original bytecodes avoids N-l bytecode dispatches at 
execution time. More details about how to build macro codes can be found in the cited 
article. Such macro codes will have to satisfy the following criteria : 

Macros have to be made out of sequences of simple bytecodes, since there is 
5 no point in reducing the dispatching cost of complex ones. 

Macros must not contain instructions that are possible branch targets, 
otherwise one would have to radically change the VM execution mechanism. A macro itself 
can be a branch target. 

Macros must terminate with control statements or method calls, since the cost 
10 of a native branch is equivalent to that of a dispatch operation. 

For implementation simplicity, the maximal length of a macro should be 
approximately of 15 bytecodes. The "natural" average macro length being of 4-5 bytecodes. 
From these criteria it is very simple to construct such macro sequences, taking very little -and 
bounded- CPU time. A simple scan of a method's bytecode is indeed enough, and most of the 
15 parsing can be table driven and single-bytecode based. 

According to a particular alternative of the preferred embodiment, which takes 
into account that unused bytecodes are very few (30-40 on average) a two-byte representation 
can be used for the new bytecodes representing the new macro-instruction. The operands of 
the original sequence are grouped right after the new sequence, which leaves them easily 
20 accessible by incrementing the program counter of the virtual machine. 

Once a process is scanned, macros can be constructed by simply cutting and 
pasting together the binary code produced by the compiler for the threaded code interpreter. 
Macros are just considered as normal bytecodes by the threading dispatcher. 

Figure 2 summarizes the preferred embodiment of a virtual machine according 
25 to the invention. The VM is implemented to load programs containing bytes codes to be 
interpreted by the VM interpreter. The main steps of the method are the following : 

step K0= EMIT: initialization of the procedure executed by the VM by loading 
the programs containing the bytecodes , 

step Kl= OPCODE : to retrieve the bytecodes to be interpreted, 
30 step K2= MACRO : replacement of sequences of simple bytecodes with macro 

bytecodes, 

step K3= TRANS : interpretation of the macro bytecodes using the indirect 
threaded interpreter method as described in figure 1, 

step K4= RES : get the result, end of the method. 
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S^cal^ysispcrfo.mcdon^on^ofac^Java.pp.ica.ions, 

^ M typical macro leng* is of 4-5 bytecodes, and .h* after the ^ 
formation, macros can be executed up «o five fimes more ofteu «ha„ me remammg 
u 7 Tht renuuning bytecodes are those for whom the implementator. K ,ust too 

execution cost, it can be significantly reduced by using the invenuon. 

•The invention brings ou, some additional advance, The processor branch 
„«, be reduced by about a factor of five. Since the code to be executed has 
ofareprocesso.spipeUneandmemorysubsys^mmaybe 

b "^TT"~l OT ft, architecture of the processor for fire 

significantly ^"^^^^ architected for the cus, of a cache Bne 

fill. On memory d ^ aching cost essentially depends on 

Lutoble code. This would have more or less the same cos, as the remanung drspatehe* 

^>>-»-^ oftheadvantagesofmacrosisthatth eyarege„eric se^encesof 
bytecodes and^n.eprobabiliry^oneofsucbseque.cescanbefounde^ewhemm.he 

bytecodes, and P ^process, is quite high. Test were mad. for 

context of another process, or eveu ™ f 

Jav a bytecodes. 1. was found ft* a significant par, of the macros - be reusedTherefore, 

^ instenc* assuming *a, it would be possible to fiulher cu. fire cos, of scheduhng by 
loiter of two, Ore tota, observable incremen, in speed would be very sural!. Most 
ffiely i, is no, worth trading agains, the doubUng of memory footpnnt. 
30 Anodrer advantage of macros is that they do no, nave any rmpac, on the 

normal byteeode dispatehing mecharism. Them is no nee4 to add another execufion 
mechanil. te .hose already existing in fir. VM There is no need te drshngursh bebveen 
ZpZrd non-compiled processes and no need te recur ,o the weirdness and overhead of 
native code interfaces. 
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Object-oriented languages like Java are characterized by the presence of very 
small units of code. Java processes are also very hard to inline, since they are almost always 
potentially polymorphic. Therefore, even if a folly optimizing compiler would be able to 
better map the process execution semantics on the underlying processor architecture, the 
5 overhead of the preamble and conclusion of binary translated processes would often suppress 
any advantage. 

To improve execution efficiency, a stack catching technique can be used, 
which keeps the first three locations of the Java stack inside the processor's register file, 
reducing considerably the number of memory accesses. The technique exploits the fact that 

10 the target processor is a stack machine itself. The original bytecode implementations are 
substituted with equivalent processor instruction sequences. By using a trivial translation 
table and a simple cost function (number of memory references), very fast and efficient 
compilation technique can be achieved. The cost reduction of memory Input / output will 
now be described, in the case of Java as an example, according to another alternative 

1 5 embodiment of the invention. 

Java is a stack-based language: bytecodes communicate with each other using 
memory. Every single bytecode execution implies at least one memory access, which turns 
out to be very expensive. Considering, for instance, the following simple expression : 
C = a + b; 

20 In a stack based language it is translated into : 



Push a — 1 read, 1 write 

Push b — 1 read, 1 write 

Add — 2 read, 1 write 

Store c — 1 read, 1 write 



25 which represents nine memory access operations. A CPU with a minimum of internal state 
can do the same with only three memory accesses. Considering the fact that on a modem 
processor architecture, memory references are among the most expensive operations, it is an 
ideal field of optimization. With a little additional coding effort, a version of the Java 
bytecodes can be made to exchange data through machines registers instead than through 

30 external memory. Macros can then be created, starting from these specialized bytecodes 

which are called strands, reducing the number of memory accesses within a macro by more 
than a factor of two. 

An implementation of the "macroizer" and of the bytecode "standifier" would 
not need too many lines of code. Partial rewrite of the interpreter's loop can be estimated, for 
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example, in about » few Kilo lines of C code. On.y a few lines of assembly are necessary for 
the implementation of the indirect threaded code dispatcher, and a few hnndreds are 

dedicated to the "standifiei". 

Tests and measures of the nmning time have been made which dont take mto 
account the time spent for the bytecode parsing and for the generation of the new macro 
byfccodes. Nevertheless the run-time was measured using a native code profiler. When 
,„pning a large application, like a web browser, tit. total time spent for "macrouation 
.ennuns limited to a very little percentage ofthe total execution time. 

An example of a receiver according to the invention is shown in fig. 2. It is 
set top box receiver 20 for interactive video transmission. It comprises a decoder, e.g. 
c^npauble with the MPEG 2 (Moving Pictures Experts group, ISCMEC 13818-2) 
..commendation, for receiving via a cable transmission channel 23 an encoded srgnal from a 
video transmits 24 and for decoding the received signal in order to retrieve me traced 
aauto be displayed on a video display 25. TTte functions of the set top box cnnbe effictently 
software implemented using a sysfcm that executes an interpreted language such as Java m 
the form of bytecodes. Tbe system comprises a main processor CPU and a memory MEM for 
^ software code portions representing inactions for causing the main processor CPU 
to carry out the methods according to the invention as described in figure 1 or 2. 

According to another embodiment ofthe invention, the set top box 20 can 
receive Java applications containing bytitcodes as part ofthe received signal. In tins case, the 
set top box would comprise a loader to load the byfccode-based programs received from a 
distant sender. 
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CLAIMS: 



1. A method of optimizing interpreted programs in a virtual machine interpreter 

of a bytecode-based language, wherein the virtual machine dynamically reconfigures itself by 
replacing an original sequence of simple bytecodes with a new sequence of macro bytecodes 
and wherein the virtual machine interpreter is coded as a threaded code interpreter for 
5 translating the bytecodes into their implementation code, comprising a reference table which 
contains references to the addresses of the implementation of the bytecodes in order that 
during translation of the current bytecode, the address of the implementation of the next 
bytecode is retrieved to be able to jump to the next bytecode. 

10 2, A method according to claim 1 , wherein the bytecodes of the original 

sequence are grouped after the new sequence of said macro operation codes. 

3. A method according to any of claims 1 or 2, wherein the virtual machine 
interpreter comprises a predetermined set of bytecodes, some of which are unused, and 

15 wherein said new sequence of macro operation codes is implemented by exploiting said 
unused bytecodes. 

4. A method according to claim 3, wherein the unused bytecodes are encoded 
with at least a two-byte representation. 

20 

5. A method of optimizing interpreted programs, in a virtual machine for a 
bytecode-based language, comprising the following steps : 

initialization by loading programs containing the bytecodes, 
replacement of sequences of simple bytecodes with macro codes, 
25 interpretation of the macro bytecodes using an indirect threaded interpreter for 

translating the bytecodes into their implementation code, comprising a reference table which 
contains references to the addresses of the implementation of the bytecodes in order that 
during interpretation of the current bytecode, the address of the implementation of the next 
bytecode is retrieved to be able to jump to the next bytecode. 
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6 A computer program product for being loaded into a memory, comprising a set 
of functions for causing a processor to carry out the method according to any one of ciauns 
lto5. 

7 A receiver for receiving transmission signals, the receiver comprising a 
p^sor (CPU) and a memory (MEM) for storing software code portions representing 
Lructions for causing the processor to carry out the memod according tt any one of cUtms 
1 to 5. 

8 A method of making available for downloading a computer program 

comprising instructions for executing the method as claimed in any one of the claims 1 to 5, 
into a receiver as claimed in claim 7. 
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